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Abstract 

The enzyme Glyceraldehyde-3-Phosphate Dehydrogenase (GAPDH) catalyses the decom- 
position of glucose. The gene that produces the GAPDH is therefore present in a wide class 
of organisms. We show that for this gene the average value of the fluctuations in nucleotide 
distribution in the codons, normalized to strand bias, provides a reasonable measure of how the 
gene has evolved in time. 

Key words: GAPDH - evolution - 4-dimensional walk model - evolutionary marker - persis- 
tent diffusion - random sequence normalised to strand bias 
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Introduction 

Evolution makes lower organisms into higher ones. The distribution of the nucleotides in the genes 
that code for proteins undergo changes in the process. It is sometimes assumed these variations in the 
nucleotide distributions come about due to random mutations. In this work we present quantitative 
evidence that the changes in the bases of the GAPDH are remarkably well ordered. 

The DNA sequence that codes for a single protein evolves as we go from one organism to the 
next. The evolution of the base composition of A, T, G and C for the same protein is the key to the 
dynamics of biological evolution. Some proteins are restricted to few organisms, others are more com- 
mon. Amongst these proteins / enzymes, the glyceraldeyde-3-phosphate dehydrogenase (GAPDH) is 
present in all living organisms, as the key enzyme in glycolysis, the common pathway both in organ- 
isms that live in free oxygen and the ones that do not. The GAPDH catalyzes the dehydrogenation 
and phosphorylation of glyceraldehyde-3-phosphate to form 1,3-bisphosphogly cerate. 

The nature of the base organisation of the DNA sequences has been studied in the recent years 
(Voss 1992; Li and Kaneko 1992; Peng et al. 1992). The fractal correlations of -jp type have been 
reported. These fractal correlations are more pronounced for the introns and the intergenic flanks. 
The exons, on the other hand, are characterised by strong peak at f=| in the power spectrum. 
Here we work only with the exon regions and attempt to isolate the physical quantity that provides 
insights into the nature of evolution in the GAPDH. 

With this in mind we pick the DNA sequences coding for the GAPDH enzyme from a wide 
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variety of prokaryotes, that include both bacteria and archaea (Woese et al. 1990), and eukaryotes 
(organisms with nucleated cells). Bacteria, in our study, is again subdivided into three groups: 
proteobacteria, Bacillus / Clostridium group and cyanobacteria. Due to paucity of data for archaeal 
GAPDH, we cannot subdivide the archaea; we compare it as a whole with the groups of bacteria 
under Prokaryota. 

Zuckerkandl and Pauling (1965) laid the basis for the study of genes and proteins for evolution. 
Over the years there have been the search for the universal common ancestor (Volkenstein 1994; 
Doolittle and Brown 1994; Woese 1998; Doolittle 1999; Woese 2000; Doolittle 2000) that may have 
preceded the prokaryotes and the eukaryotes. The studies on the ribosomal RNA provided some of 
the insights (Woese and Fox 1977a, 1977b; Fox et al. 1980). The relative importance of the elements, 
such as mutations, lateral gene transfer (Krishnapillai 1996; Brown and Doolittle 1997; Jain et al. 
1999; Ochman 2000), that drive the evolution of species continues to be under active investigation. 
In our work here with the GAPDH we try to isolate the physical quantity (called X) that measures 
the evolution in this gene. 
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Number Fluctuations 

The coding sequences of the GAPDH genes from 42 different species, with 31 eukaryotes and 11 
prokaryotes, were chosen (Source: GenBank and EMBL nucleotide sequence databases). These 
sequences have different distribution of the bases A, T, G and C. Since the codons are made of 3 of 
these bases, we divide the sequence into codons, i.e. choose the window size 3 bases long. 

On these windows of size 3, we compute the square of the numbers of A, T, G and C and define 
N(3) as: 

N(3) = n 2 A {3) + n 2 T (3) + n 2 G (3) + n 2 c (3) 

where n^(3),n^(3),n|.(3),n^(3) are the numbers of A, T, G and C respectively in the codon window of 
size 3. Thus if, for instance, A occurs in all the three positions we get N(3)=9. If two are identical we 
get N(3)=4+l=5. If all the positions are occupied by different nucleotides, we get N(3)=l+1+1=3. 

Thus N(3), for the window size 3, varies from 3 to 9 as we go from one codon to the next along 
the gene. We then compute the average value of N(3), call it < N(3) >, over the sequence. We 
notice here that a high value of < N(3) > implies repeats of the bases. This means persistent sort of 
correlation amongst the bases. In other words, higher value of < N(3) > implies a higher probability 
that the A, for instance, is going to be followed by the A. Conversely a lower value of < N(3) > 
implies an antipersistent order in the sequence leading to a lower probability for the A to be followed 
immediately by the A. 
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What do we expect for < N(3) > for the random sequence of identical strand bias? Strand bias is 
the proportion of A, T, G and C in the sequence. These proportions vary as we go from one GAPDH 
sequence to another. We want to isolate the effect above and beyond the strand bias, therefore, 
study the quantity X defined as: 

x - < m > m 

X ~ <iV(3,r)> (1) 
where < N(3, r) > is the average value of the quantity N{3) for the random sequence of identical 

total length and strand bias. 

< N(3) > is measured for the sequences, while < N(3, r) > is calculated using a 4-dimensional 

walk (Montroll and West 1979; Montroll and Shlesinger 1984) model. Hence the quantity X is 

obtained. 

To calculate < N(3, r) > consider the following walk model in 4-dimensions corresponding to A, 
T, G and C. If we encounter the symbol i (i=A, T, G and C) we move one step along i. In this 
directed walk the probability function for a single step clearly is : 

Pi(x) = E*" 1 ) ( 2 ) 

i 

where x=(x j 4,xt,xg,xc), and Pi—j^ ; n, is the number of times the symbol i appears in the sequence; 
N is the total number of symbols, i.e. the length of the sequence. We want to get the distributions 
after m steps, and therefore, define the characteristic function of the single step: 

Pi(k) = £p*e*. (3) 



For m steps: 

Pm(k) = 5>e*p (4) 

i 

The quantity m is clearly the total number of steps, i.e. the window size. The moments of the 
distribution may be obtained by differentiating P m (k) with respect to k. In particular < iV(3, r) > 
is just the second moment of distribution and obtained from P m {k): 

<iV(3,r)>= E^Pk^o (5) 

Using (4) and (5), we get: 

<N(3,r)>= m[(m- 1)J> 2 + 1] (6) 

where we have used the relation J2Pi — 1- 

To crosscheck this relation, let us first set Pa=1; Pt=Pg = Pc = 0. This is the case of maximal 
persistence. All the three bases, in this limit, are identical. From (6), we find: 

<iV(3,r)>=9, (7) 

as we expect. 

To check again set Pa=Pt=Pg = Pc = \- The average value, from (6), gives: 

< N(3,r) > = 4.5 (8) 

For the window size m=3 the possible choices consistent with Pa=Pt=Pg = Pc=\ are 4x4x4=64, 
namely, the 61 codons + 3 stop codons. Calculation of the < iV(3, r) > for these 64 combinations is 



straightforward and gives the value 4.5 in agreement with (8). 
Nucleotide Sequence Comparison 

The pairwise sequence alignment tool (ALIGN at the Genestream network server) available in the 
public domain gives a measure of the "distance" (or the cross correlations) between the sequences. 
These distances provide additional data towards the study of evolution in the GAPDH gene. 

In the usual studies of evolution and phylogeny one relies exclusively on nucleotide sequence 
comparison. The rules used for alignment of sequences are constructed to give rise to the known 
pattern. 

In contrast, the change in the value of the X appears to us as the physical quantity of interest 
in the evolution in the GAPDH gene. The nucleotide sequence comparison we use in this work as 
supplementary, supportive data. 

The X of Evolution 

The X values for the eukaryotes and the prokaryotes, for the GAPDH, for window size of 3, are given 
in Table 1. 

Interestingly, the table 1 suggests two parallel lines of evolution, one for the prokaryotes; the 
other for the eukaryotes. Note the value of the X for the cyanobacterial genes is closer to that for 
the amphibian gene. The values for Bacillus / Clostridium group and archaea are more or less the 
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same as those for fish, and higher invertebrates such as arthropods. 

As we look separately amongst the prokaryotes and the eukaryotes the X values increase as 
follows: 

Prokaryota: proteobacteria < archaea < Bacillus/ Clostridium group < cyanobacteria 
Eukaryota: fungus < invertebrate < fish < amphibia < bird < mammal (excl. human) < human 
It is to be remembered that in arriving at this increasing pattern the average value of the X over 
the members of the group has been considered. Within each group there are variations in the X (see 
Table 1). 

Assume now the GAPDH gene began from common universal ancestor. The route diverged to 
give proteobacteria on one side; fungal and invertebrate genes on the other. The proteobacterial 
gene develops further into three, archaeal, Bacillus / Clostridium group and cyanobacterial, genes. 
The other trail from the fungus goes through fish, amphibia, probably reptilia for which the data is 
unavailable, birds and other mammals to reach its peak on humans. 

Some groups have hypothesized that the eukaryotic species originated as the archaeal (e.g. 
Thermoplasma-like organisms) and the bacterial (e.g. Spirochaeta-like organisms) cells merged in 
anaerobic symbiosis and the GAPDH gene was contributed by the bacterial partner (Martin et al. 
1993; Margulis 1996). Our results do not disprove this assumption. The X value averaged over all 
members of bacteria (i.e. proteobacteria + Bacillus / Clostridium group + cyanobacteria) becomes 
0.9662 ±0.028 that is close to the X values for the invertebrates and the fungi (Table 1). 

10 



Sequence Comparison 

The pairwise alignment tool gives a measure of similarity, or distance, between the various GAPDH 
genes under consideration (Figure 1). 

The results are fairly consistent with the picture that emerges from the study of the X. It suggests 
that the eukaryotic GAPDH genes might have originated from some eubacterial genes (Martin et al. 
1993; Margulis 1996). 

The alignment tool also suggests that both archaea and cyanobacteria may be quite distant from 
all other groups (Hensel et al. 1989; Arcari et al. 1993). As we measure the sequence similarity of 
the archaeal and the cyanobacterial genes with genes from the other two prokaryotic groups, we find 
the Bacillus/ Clostridium group gene closer to them than the proteobacterial one. This too supports 
the view obtained from the X values of the prokaryotes. 



The X Evolution of the GAPDH Exon 

The plot of X for eukaryotes against their approximate period of origin in the geological time scale 
(Table 2) gives a fairly linear fit. We try a fit of the form y = Kx + c. For the slope K for the 
eukaryotes we get: 

A.X 

Keuk = = L1 x 10 _4 (±0.2 x 1(T 4 ) (myr)-\ (9) 
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where myr=million years. The computed x 2 value is 0.00009 with 6 degrees of freedom. 
The earliest lifeforms are thought to come about around 3500 million years before present (myr BP). 
Presently we presume them as the proteobacterial ones. If the slope of the prokaryotic GAPDH gene 
X-evolution is assumed close to that for the eukaryotes, (9), then the cyanobacteria must have arisen 

AT = K- u \ [X cyano - X proteo ] = 493.5 (±126.6) (myr) (10) 

after the proteobacteria. In myr BP this is 3500 - [493.5 (±126.6)] = 3006.5 (±126.6). Similarly, the 
periods of origin of the Bacillus/ Clostridium group and the archaea may be arrived at, and given in 
Table 3 and Figure 2. 

Fossil stromatolites are macroscopic structures produced by some species of cyanobacteria. These 
are believed to occur from the early Precambrian (i.e., 3000 myr BP) to the Recent period (Thain 
and Hickman 1994). This is in good agreement with (10) for the time of origin of cyanobacteria 
obtained from the X-evolution. 

For an alternate approach assume the cyanobacteria appeared around 3000 myr BP, and the 
proteobacteria 3500 myr BP. The rate of change of the X, i.e. 

K ^ = X cyano -^X proteo = l x lQ _ 4 ( myrBp yl (n) 

Thus the slope of the prokaryotic GAPDH gene X-evolution (11) comes out to be nearly identical to 
that for the eukaryotes (9). Figure 3 shows the best linear fits for the prokaryotes and the eukaryotes, 
which appear as two almost parallel lines. 
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Discussions 

For the GAPDH exon the quantity X rises uniformly on two almost parallel paths - one for the 
prokaryotes; the other for the eukaryotes. The uniformity of rise in the X with time implies the 
genetic evolution is well-ordered; not the result of some random mutations. 

The rise of the X implies the trend towards persistent correlations in the base arrangement of 
codons. That is, as we go up the ladder of evolution the probability that a nucleotide, for instance the 
A is followed by the A increases. Note the result is true for the window of size 3. Whether the increase 
in persistence continues for any window size remains outside the scope of our analysis. The increase 
in persistence in the window of size 3 gives a measure of the complexity of the sequences at this scale 
(Roman- Roldan et al. 1998). The diffusive processes that have persistence are being studied widely 
in recent years. For the GAPDH gene, suppose we work in the basis of purine-pyrimidine instead 
of the full A, T, G and C. We find, amusingly, the persistent nature of the diffusion increases even 
more for the window of size 3. Going beyond the GAPDH we find there are other important genes 
that share these features. 

For the archaea the sequence comparisons indicate that they are more or less equally distant from 
the other prokaryotes and the eukaryotes. Yet the X-measure of the archaea places them between 
the proteobacteria and the Bacillus/ Clostridium group. The sequence information for the vertebrate 
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GAPDH genes, especially for the amphibia, as of now, is limited. The availability of more data would 
improve the results to a considerable extent. 

The ordered, uniform X-evolution of the GAPDH exon allows us to estimate the times of origins 
of Bacillus/ Clostridium group, cyanobacteria, archaea. The time of origin of cyanobacteria falls near 
the previous estimates. 

To conclude, the GAPDH gene is shown to be a marker for evolution. Importantly, the physical 
quantity X, the second moment of the codon base distribution, normalised to the strand bias, bears 
the footprint of a remarkably ordered evolution. 
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Figure Legends 



Figure 1. Average % identity of nucleotide sequence in the GAPDH genes from different groups 
of organisms. The black lines and values imply the alignment results between the proteobacterial 
gene and the genes from all other groups; the pink lines and values for the Bacillus / Clostridium 
group gene with the other genes; the green lines and values between the archaeal gene and the other 
genes; and the blue lines and values for the cyanobacterial gene with the rest. 

Figure 2. The probable periods of origin of the prokaryotes (see Table 3), along with the peri- 
ods of origin of the eukaryotes (see Table 2), are plotted against the X values for the corresponding 
GAPDH genes (see Table 1). The error bars simply indicate the standard deviation from the average 
X values for the respective groups. Here the slope of the prokaryotic GAPDH gene X-evolution is 
assumed to be equal to that for the eukaryotes. 

Figure 3. The best linear fit-curves both for the prokaryotes and for the eukaryotes, as we plot 
the X values vs. the periods of origin. The solid black lines denotes the best fit-curves. The slopes of 
the GAPDH gene X-evolution for the prokaryotes and the eukaryotes are found to be close enough 
to suggest two nearly parallel lines of evolution. 
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Table 1: The X values for prokaryotes and eukaryotes, along with the range of deviations 
in respective categories. 



Category 


X 


I. PROKARYOTA 




proteobacteria 


0.9445 (±0.0127) 


archaea 


0.9892 (±0.0075) 


Bacillus/ Clostridium group 


0.9896 (±0.0126) 


cyanobacteria 


0.9970 (±0.0110) 
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Category 


X 


II. EUKARYOTA 




fungus 


0.9623 (±0.0121) 


invertebrate 


0.9677 (±0.0134) 


fish 


0.9819 (±0.0097) 


amphibia 


1.0098 


bird 


1.0102 (±0.0021) 


mammal (excl. human) 


1.0234 (±0.0019) 


human 


1.0301 
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Table 2: Origin of eukaryotes in geological time scale. 



Category 


Position in time scale (myr BP) (Stein and Rowe 1995; Pough et al. 1999) 


Fungus 


570 


Invertebrate 


510 


Fish 


439 


Amphibia 


363 


Bird 


146 


Mammal (excl. human) 


66.4 


Human 


1.64 
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Table 3: Probable origin of prokaryotes in geological time scale as emerged from their X 
values. 



Category 


Position in time scale (myr BP) 


Proteobacteria 


3500 


Archaea 


3079.5 (±108.2) 


Bacillus/ Clostridium group 


3076.0 (±108.9) 


Cyanobacteria 


3006.5 (±126.6) 
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This figure "figure l.j peg" is available in "jpeg" format from: 



http://arXiv.org/ps/physics/0006055v3 
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